Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repeated mangled names in read_csv with duplicate column names #8645

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jul 2, 2021

Fixes mangled name bug read_csv with duplicate columns.
mismatch with pandas behavior.

csv file:

A,A,A.1,A,A.2,A,A.4,A,A
1,2,3,4.0,a,a,a.4,a,a
2,4,6,8.0,b,b,b.4,b,a
3,6,2,6.0,c,c,c.4,c,c
A A A.1 A A.2 A A.4 A A
A A.1 A.1.1 A.2 A.2.1 A.3 A.4 A.4.1 A.5

Pandas:

In [1]: import pandas as pd
In [2]: pd.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c

cudf: (21.08 nightly docker)

In [1]: import cudf
In [2]: cudf.__version__
Out[2]: '21.08.00a+238.gfba09e66d8'
In [3]: cudf.read_csv("test.csv")
Out[3]: 
   A  A.1 A.2 A.3 A.4 A.5
0  1    3   a   a   a   a
1  2    6   b   b   b   a
2  3    2   c   c   c   c

This PR fixes this issue.

In [2]: cudf.read_csv("test.csv")
Out[2]: 
   A  A.1  A.1.1  A.2 A.2.1 A.3  A.4 A.4.1 A.5
0  1    2      3  4.0     a   a  a.4     a   a
1  2    4      6  8.0     b   b  b.4     b   a
2  3    6      2  6.0     c   c  c.4     c   c

Related info (sparks):
Spark duplicate column naming.
https://issues.apache.org/jira/browse/SPARK-16896
apache/spark#14745
cudf sparks addon doesn't use libcudf names. So, this PR does not affect it.

@karthikeyann karthikeyann requested a review from a team as a code owner July 2, 2021 14:51
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2021
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 2, 2021
@codecov
Copy link

codecov bot commented Jul 2, 2021

Codecov Report

Merging #8645 (b9a4c9e) into branch-21.08 (fba09e6) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head b9a4c9e differs from pull request most recent head 81494a9. Consider uploading reports for the commit 81494a9 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.08    #8645      +/-   ##
================================================
+ Coverage         10.60%   10.61%   +0.01%     
================================================
  Files               109      109              
  Lines             18280    18645     +365     
================================================
+ Hits               1938     1980      +42     
- Misses            16342    16665     +323     
Impacted Files Coverage Δ
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/feather.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fba09e6...81494a9. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remembered - please add a test case that exercises the behavior change.

@karthikeyann karthikeyann requested a review from a team as a code owner July 5, 2021 08:44
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge 4 - Needs cuDF (Python) Reviewer and removed 3 - Ready for Review Ready for review by team 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jul 5, 2021
@vuule
Copy link
Contributor

vuule commented Jul 6, 2021

rerun tests

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Jul 6, 2021
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@ajschmidt8 ajschmidt8 changed the title Fix repeated mangled names in read_csv with duplicate column names Fix repeated mangled names in read_csv with duplicate column names Jul 6, 2021
@rapids-bot rapids-bot bot merged commit d77ba82 into rapidsai:branch-21.08 Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants